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STOCHASTIC EXPANSIONS USING CONTINUOUS 
DICTIONARIES: LEVY ADAPTIVE REGRESSION KERNELS 

By Robert L. Wolpert^, Merlise A. Clyde^ and Chong Tu 

Duke University, Duke University and PIMCO 

This article describes a new class of prior distributions for non- 
parametric function estimation. The unknown function is modeled as 
a limit of weighted sums of kernels or generator functions indexed 
by continuous parameters that control local and global features such 
as their translation, dilation, modulation and shape. Levy random 
fields and their stochastic integrals are employed to induce prior dis- 
tributions for the unknown functions or, equivalently, for the number 
of kernels and for the parameters governing their features. Scaling, 
shape, and other features of the generating functions are location- 
specific to allow quite different function properties in different parts 
of the space, as with wavelet bases and other methods employing 
overcomplete dictionaries. We provide conditions under which the 
stochastic expansions converge in specified Besov or Sobolev norms. 
Under a Gaussian error model, this may be viewed as a sparse re- 
gression problem, with regularization induced via the Levy random 
field prior distribution. Posterior inference for the unknown functions 
is based on a reversible jump Markov chain Monte Carlo algorithm. 
We compare the Levy Adaptive Regression Kernel (LARK) method 
to wavelet-based methods using some of the standard test functions, 
and illustrate its flexibility and adaptability in nonstationary appli- 
cations. 

1. Introduction. Popular approaches for nonparametric Bayesian esti- 
mation of unobserved functions generally employ as prior distributions ei- 
ther Gaussian processes (or random fields, in two or more dimensions) or 
mixtures of Dirichlet processes. In this article, we focus attention on a wider 
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class of processes, Levy random fields and their stochastic integrals. These 
include Gaussian random fields as a limiting case, while Dirichlet processes 
may be represented as "normalized" variants of the Gamma Levy random 
field; Levy random fields thus provide an important link between two of 
the random processes that form the foundation of Bayesian nonparametric 
methods (see Section 6). In this article, we construct prior distributions for 
the mean function in nonparametric regression as stochastic integrals of Levy 
random fields. Under suitable regularity, these can be expressed as stochas- 
tic expansions using continuous dictionaries, permitting tractable Bayesian 
inference. While our focus is on nonparametric regression, we hope that the 
reader will see the possibilities of using Levy random fields in other contexts. 

To begin, suppose we have noisy measurements {lijjg/ of an unknown 
real-valued function / : — )• M observed at points {xjjjg/ in some complete 
separable metric space X, with E[li] = f(xi). In nonparametric regression 
models, the mean function /(•) is often regarded as an element of some 
Hilbert space 7i of real- valued functions on X, and is expressed as a linear 
combination of basis functions {gj } CTi: 



with some (finite or infinite) number J of unknown coefficients {/3j}o<j<j- 
There is a vast literature on classical and Bayesian approaches for estimat- 
ing / from noisy data using such methods as regression splines, Fourier 
expansions, wavelet expansions, and kernel methods, including kernel re- 
gression and support (or relevance) vector machines [see Chu and Mar- 
ron (1991), Cristianini and Shawe-Taylor (2000), Denison et al. (2002), Vi- 
dakovic (1999), Wahba (1992), for background and references]. Many ap- 
proaches, including smoothing splines and support vector machines, use 
as many basis elements, J, as there are data points, n = |/|, but employ 
regularization to avoid over-fitting. Sparser solutions (using fewer basis el- 
ements, J <^ n) may be obtained through more stringent regularization 
penalties, as in the Lasso [Tibshirani (1996)] and Dantzig Selector [Candes 
and Tao (2007)] approaches, or (often equivalently) in Bayesian methods 
through choice of prior distributions, as in relevance vector machines [Tip- 
ping (2001)]. Sparse solutions may also be achieved by using variable selec- 
tion techniques to choose a few well-placed basis functions, perhaps in con- 
junction with regularization [Chen, Donoho and Saunders (1998), Denison, 
Malhck and Smith (1998), DiMatteo, Genovese and Kass (2001), MaUat and 
Zhang (1993), Johnstone and Silverman (2005b), Smith and Kohn (1996), 
Wolfe, Godsih and Ng (2004)]. 

In most signal processing and other nonstationary applications, no single 
(especially orthonormal) basis will lead to a sparse representation [Donoho 
and Elad (2003), Wolfe, Godsill and Ng (2004)]. Overcomplete dictionaries 
and frames [Daubechies (1992), Mallat and Zhang (1993)] provide larger 
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collections of generating elements {gu)}u)en than would a single basis for 7i, 
potentially allowing for more effective signal extraction and data compres- 
sion. Examples of overcomplete dictionaries include unions of bases, Ga- 
bor frames, nondecimated or translational invariant wavelets, wavelet pack- 
ets, or more general kernel functions or generating functions g{x,u}) where 
u £0, controls features (local or global) of the generating function, such as 
translations, dilations, modulations and shapes. Because of the redundancy 
inherent in overcomplete representations, coefficients for expansions using 
overcomplete dictionaries are not uniquely determined. This lack of unique- 
ness is advantageous, permitting more parsimonious representations from 
the dictionary than those obtained using any single basis. 

In this article, we develop a fully Bayesian method for the sparse re- 
gression problem using stochastic expansions [Abramovich, Sapatinas and 
Silverman (2000)] of continuous dictionaries. We begin in Section 2 by in- 
troducing Levy random fields, which are used to induce prior distributions 
for f £T-L through stochastic integration of a kernel function with respect to 
a signed infinitely divisible random measure. We call the new model class 
Levy Adaptive Regression Kernel or "LARK" models. The LARK frame- 
work allows both the number of kernels and kernel-specific parameters to 
adapt to any nonstationary features of /. Both finite and infinite expansions 
are considered. Exploiting the construction of Levy random fields through 
Poisson random fields, we develop finite approximations to infinite expan- 
sions in Section 3 that permit tractable inference. In Section 4, we provide 
conditions under which the functions are almost surely in the same function 
space as the generating kernel. We describe the hierarchical representations 
of LARK models in Section 5 that enable posterior inference for the LARK 
model using reversible jump Markov chain Monte Carlo (RJ-MCMC) meth- 
ods. In Section 6, we discuss relationships among LARK and other popu- 
lar parametric and nonpar ametric methods. We then compare our LARK 
method to other procedures using simulated data in Section 7 and real data 
in Section 8. In Section 9, we discuss possible extensions of the LARK model. 

2. Stochastic expansions and prior distributions. To make inference about 
the unknown mean function f £% given noisy observations Yi of f{xi) for 
{xi} C we must first propose a prior distribution on % for /. Let be 
a complete separable metric space and (/> : x — )• M a Borel measurable 
function, and set (j)j{xi) = (j){xi,ujj) for some collection {ujj} C 17. As a slight 
extension of the basis expansion of (1), set 



0<j<J 

for a random number J < oo of randomly drawn pairs (f3j,ujj) G M x 0. 
This is equivalent to specifying a random signed Borel measure C{duj) = 



(2) 




4 



R. L. WOLPERT, M. A. CLYDE AND C. TU 



f!^j^uij{duj) on il, giving the equivalent representation: 

(3) fix)= [ cl){x,oj)C{doj). 

Jn 

The task of assigning prior distributions to functions /(•) of the form (2) 
is equivalent to that of specifying prior distributions for the random mea- 
sure C[duj) in (3), that is, to specifying consistent joint probability distri- 
butions for all random vectors of the form {C{Ai) , . . . , C{Ak)) for disjoint 
Borel sets Ai C fi. Levy random measures, those for which {C{Ai)} are in- 
dependent for disjoint {Aj}, are ideal for this purpose, since (as we will see 
in Section 5.3) they are simple to construct and amenable to posterior sim- 
ulation. To make ideas more concrete, we first describe possible choices for 
the generating functions (j)[x^Lo) used in our stochastic expansions and then 
proceed with the presentation of Levy random measures in Section 2.2. 

2.1. Generating functions. Possible choices for w) for = M include 
translation-invariant kernel functions, such as the Gaussian 

(4a) (l)G{x,uj) = eyi^{-\\{x-xf} 

or the Laplace 

(4b) (t)L{x,uj) = eyi^{-\\x-x\} 

kernels with oj = (x, A) G <^ x M"^ = Vt. There is no need to restrict attention 
to symmetric (e.g.. Mercer) kernels, as required in the conventional Support 
Vector Machine (SVM) approach [Law and Kwok (2001), Solhch (2002)]. 
Asymmetric kernels, such as the one-sided exponential 

(4c) 4>E{x,uj) = eic£,{-\{x-x)]'^{x>x} 

are useful, for example, in modeling pollutant dissipation over time. Other 

possibilities include piecewise-constant Haar wavelets on = (0, 1] , 

(4d) (t>H{x,uj) = l{o<,\{x-x)<i} 

or continuous rescaling and shifting of other wavelet functions 

(4e) (l,^{x,uj) = \ymX{x-x))- 

In each of these examples, Q is a location-scale space with location parame- 
ter X s-iid parameter A determining the scale. Higher-dimensional spaces X 
may be accommodated in a similar way; for example, in Section 8.2 we use 
space-time kernel 

(4f) ^^t{x,u:) = exp{-i(s - a)' K{s -a)- \\t - r|} 

for space-time point x = (s, t) £ x M_(_; here uj = (a, r. A, A) includes a spa- 
ce-time point (cT, r) G x ]R_|_, a positive-definite spatial dispersion matrix 
A G S2 , and a temporal decay rate A S M+. 
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2.2. Levy random measures. For any i/^ >0 and any probability distribu- 
tion 7r{d/3 duj) on M X ri, let J~ Po{v^) be Poisson-distributed with mean , 

and let {(/3j,a;j)}o<j<j iT{dl3 du); then the random measure given by 

(5) C{A)^ Yl ^A{co,)f^j 

0<j<J 

assigns independent infinitely-divisible (henceforth "ID") random variables 
C{Ai) to disjoint Borel sets Ai cil., with characteristic functions 

(6) E[e^*^(^)] = expj / / {e''^ - l)u{dpduj)] 

[J JmxA J 

with v{dl3duj) = v^T:{dl3 duj). More generally, the "Levy measure" ^{djiduj) 
need not be finite for the random measure C to be well defined, so long as 
the integral in (6) converges for all i £ M; since the integrand is bounded on 
all of M X n and is of order 0(/3) near /3 ~ 0, this will hold for any measure 
that satisfies the local Li integrability condition 



(7) / / {I A\l3\)v{d(3 duo) < oo 

J JrxK 

for each compact K ci^. The mean and variance, when they exist, are given 
by E[CiA)]=ff^^^l3i,{d(3du;) and VsriCiA)] = ff^^^/S^d/S du;), respec- 
tively. 

Khinchine and Levy (1936) showed that the most general ID random vari- 
ables [and hence the most general ID- valued random measures; see Rajput 
and Rosihski (1989), Proposition 2.1] have characteristic functions of the 
form 



(8) 



E[e^*^(^)] = exp<J it6{A) - ^t^J:{A) 



+ [ [ {e''^ -l-itho{P)Hdpduj)], 
J JrxA J 



where /io(/3) = /31[-i,i](/5), determined uniquely by the characteristic triplet 
of sigma-finite measures (5, S,i/) consisting of a signed measure 6{doj) and 
a positive measure Ti{dijj) on and a positive measure v{d(3duj) on M x 
that satisfies the local L2 integrability condition 



(9) / / {I A (3')v{dl3 duj) < 00 

J JrxK 

for each compact K C Q and ^({O},^}) = (for more details on this non- 
stationary version of the classic Levy-Khinchine formula see Jacod and 
Shiryaev [(1987), page 75], Cont and Tankov [(2004), pages 457-459] or 
Wolpert and Taqqu (2005)). 
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The role of the compensator function /io(/3) is to make the last integrand 
in (8) bounded and 0(/3^) near /5 ~ 0, permitting the replacement of (7) with 
the weaker condition (9); in this case C{du]) may have countably-many points 
of support {wj} C whose magnitudes {/3j} are not absolutely summable, 
precluding a representation of the form (5). The compensator /io(/3) may be 
replaced by any bounded measurable function satisfying 

(10) /i(/3)=/3 + 0(/32), /3«0, 

with a corresponding replacement of 5{dLij) with 6h{duj) = 5{doj) + J^[h{l3) — 
hQ{(3)]v{dl3 dw). Whenever (7) is satisfied, we may take h{l3) =0 with the 
same adjustment to Sq. 

By (8) the random measure C may be written as the sum of two indepen- 
dent parts: a Gaussian portion, assigning independent normally-distributed 
random variables with mean 5h{Ai) and variance S(^j) to disjoint sets Ai, 
and the remaining portion, with characteristic function 

(11) E[e^*^(^)] = exp| / / {e''^ -l-ith{P))v{dl3duj)\. 

{.J JrxA J 

We call a random signed measure C with no Gaussian component [i.e., an 
ID-valued measure with S(r2) = 5h{^) = 0, that satisfies (11)] a Levy ran- 
dom measure. Nonnegative Levy random measures satisfying (7) were called 
"completely random measures" by Kingman (1967). 

2.3. Levy random fields. A Levy random measure C satisfying (11) in- 
duces a linear mapping (pt-^ C[(j)] from functions (/> : O — )• M to random vari- 
ables £[(/)] = J^(j){u)C{duj); such a mapping is called a random field. For 
simple functions 4>{uj) = J^o-i^Aii^) with each Ai C compact, we set 
C[(p] = ^ aiC{Ai) and verify that 

(12) E[e^*^[<^l] = expj / / (e^*<^(^)^ - 1 - it^{uj)h{P))u{dp dco)] . 

[J JRxn J 

It is straightforward to extend this by continuity in probability to (at least) 
all bounded measurable compactly-supported <^ : — ?• M. We now present 
a general construction based on Poisson random fields, the key to our ap- 
proach to tractable posterior Bayesian inference. 

2.3.1. Poisson construction L Uncompensated. When ^{dfiduj) satisfies 
(7) (i.e., |/3| is locally z/-integrable at zero) we may take /i(/3) = in (12) and 
construct C as follows. Begin with a Poisson random measure J\f{dj3duj) ~ 
Po(z^) on (M X 0) that assigns independent Poisson-distributed random vari- 
ables N{Ci) ~ Po{u{Ci)) with means v{Ci) to disjoint Borel sets Cj C (M x 17). 
For any Borel set Ad^ with compact closure A and bounded measurable 
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compactly-supported : — M, set J = A/'(M x A) and 




where {{j3j,ujj)} is the (random) set of J < oo support points of J\f{dl3du]). 
The integrals and sums in (12), (13) are well defined for all (j) for which 



which by (7) includes all bounded measurable compactly-supported functions. 

For any Borel sets Ad^ and i? C M, the Poisson measure N assigns to 
the set -B X ^ C M X O the number M{B x A) of £'s support points ujj G A 
with mass of sizes f3j & B. By (7) this is necessarily finite if A has compact 
closure and B is bounded away from zero, but if ^{M x 17) = oo then C will 
have J = oo support points in altogether with (almost surely) absolutely 
summable magnitudes X^o<i<j{l/^il -^i ^ ^} < oo. 

2.3.2. Poisson construction II: Compensated. The situation is more deli- 
cate in case the Levy measure does not satisfy (7), but only the weaker bound 
in (9) (i.e., if is locally z^-integrable but |/3| is not). Begin again with the 
Poisson measure M ~ Po(;/) on M x fi, and introduce the compensated or 
centered Poisson measure M{dj3 doo) = M{df3 du) — u{df3 dco) with mean zero 
[Sato (1999), page 38], inducing an isometry from L2{R x Q,,u[d(3 dw)) to 
the square-integrable zero-mean random variables. Following Wolpert and 
Taqqu (2005), set 



for any measurable (j) for which (14) converges. If (7) holds, one may sim- 
plify (14) to 





(14) 





(15a) 




(15b) 



0<j<J 
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showing that the role of the compensator is to add an /i-dependent "drift" (or 
offset, in higher dimensions) term 6h[(j)] = — J /uxn h{j3)(f){uj)v{dj3 du) to (13). 
When (7) fails, however, both the uncompensated sum and dh[(j)] in (15) will 
be infinite, while the representation of (14) remains valid under the following 
conditions. 

Theorem 1. Let u he a Levy measure onM x 17 satisfying (9). Then C[(j)\ 
is well defined by (14) with characteristic function given by (12) for com- 
pensator ho{l3) = /31{|^|<i} if (j) satisfies 



(16a) // {1 A\f3(j){u})\)i^{dpdu}) <oo, 

(16b) [[ (|/3(/.(l^)| A|/3(/.(w)|2)zy(d/3(ia;) <oo. 

J J[~i,i]xn 

If, in addition, (p satisfies 



(16c) // (1 A/?^)|(/)(w)|zv(d/3dw) <oo, 

then C[(j)] is well defined for any compensator h{f3) satisfying (10). 

Proof. Under these conditions, the integrands of the compensated and 
uncompensated Poisson integrals in (14) are in the Musielak-Orlicz spaces 
for which those integrals are well defined; see Rajput and Rosiiiski [(1989), 
page 9], Kwapieii and Woyczyhski (1992). □ 

In particular: 

Corollary 1. i^[4>] is well defined with characteristic function (12) for 
any function (j) satisfying 



(17) // {I A I3^){\(p{uj)\y (t)\uj))v{d(3duj) <oo, 

including [by (9)] all bounded measurable compactly- supported (j). Thus, 
C{A) = vC[lyi] is always well defined for any Borel set AdO. with compact 
closure A. 

Similarly: 

Proposition 1. For a Levy measure v satisfying (7), take h{f3) = 0; then 

(13) m= [ [ PH^)Ar{dpdoj)= (i,{ojj)f3j 

J JRxn n^„-^ T 



[with J = M{M. X 0) < c«y is well defined with characteristic function (12) 
for any (p satisfying 

(18) [[ {lA\P^{uj)\)iy{dpduj)<oo. 

J JRxn 
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2.4. Constructing Levy kernel integrals. Denote by $ the linear space of 
functions (/>: $7 — )• M for which C[(J)] has been defined; we have seen that this 
includes at least all bounded measurable compactly-supported functions (p. 
Denote by G the linear space of measurable functions g:X^^, and simplify 
notation by writing "g(x,a;)" for g[x){uj). Each of the generating functions 
introduced in (4) lies in Q. For any g &G,we can construct a random function 

(19) f{x)^C[g{x)] 

g{x,uj)[/3 -h{(])]Af{d/3du) 



g{x,co)h{(3)M{df3dLo) 

ixn 

= 5^ gix,u;,)[f3,-hi(3,)] 

0<j<J 

+ 11 g{x,uj)h{P)M{dpduj) or 

(20) = g{x,L0j)l3j if (7) holds so compensation is unneeded. 

o<i<J 

Integer moments of f{x) are easy to compute, when they exist, from the 
characteristic function given in (12), for example: 

(21a) E{f{x)}=[[ ^{x,uj)[f3-h{f3)]u{dfidco), 

(21b) C0V{/(X1),/(X2)}= // ^{xi,L0)^{x2,Lo)^^u{dpdLo). 

2.5. Examples of Levy measures. We now consider some specific exam- 
ples of Levy random fields and the corresponding kernel integrals. Familiar 
examples include Poisson, Gamma, Cauchy and more generally a-Stable 
random fields. 

2.5.1. Compound Poisson processes. The simplest model to consider would 
be that of (2), with finite Levy measure satisfying i/"*" = i/(M x Q) < oo, re- 
produced here: 

(2) /(x)= Yl 't>{x,oj,)Pj. 

0<j<J 

This has a Poisson-distributed number J ~ Po{i'~^) of terms whose loca- 
tions LOj and magnitudes f5j are i.i.d. with an arbitrary distribution {I3j,ujj} 
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7r{df3du}), hence Levy measure of the form v[df3du) = v'^'K{dl3 dto). The 
marginal distribution of f{x) at each x & X \s compound Poisson. 

2.5.2. Gamma random fields. The Levy measure for the Gamma random 
field is infinite but satisfies the strong local Li integrability condition (7), 
obviating compensation; in the homogeneous case, it is 

(22) i^{d/3duj) = p-^e-'^'^l{^yoydp-f{duj) 

for some cr-finite measure ^{dw) on il, giving C{A) ~ Ga(7(A),r7) [with 
mean ^{A)/rj\ for Borel measurable Ad^ with ^{A) < oo. Because is 
concentrated on M4., the mass /3j at each of the Gamma random mea- 
sure's support points LOj is positive, so all the coefficients in the expres- 
sion /(x) = ^(j){x,ujj)l3j are nonnegative. With a nonnegative generating 
function 4> G G, this provides a direct way to construct nonnegative mean 
functions / > without having to transform the responses {1^} as Gaussian 
methods would require. The mean E[/(x)] = r]~^ J g{x,uj)j{duj) is available 
from (21a), as is the covariance from (21b). 

2.5.3. Symmetric Gamma random fields. A symmetric analogue of the 
Gamma random field (22) has Levy measure 

(23) i^{dl3duj) = \^\-^e-^'^^''d(3j{duj) 

on all of M X ri, leading to random variables C{A) distributed as the differ- 
ence of two independent Ga('j{A),r]) variables, with characteristic function 
^jgrt£(A)j _ ^-|^ _|_ ^2y^2^-7{A)_ gg^}^ ^}^g standard positive Gamma random 

measure and this symmetric version satisfy the local Li bound (7), hence no 
compensation is required so we may take h{/3) = and employ the simple 
construction (20) of f{x). The mean E[/(x)] = vanishes for the symmetric 
Gamma random field, or for any other Levy random field with a symmetric 
(in ib/3) Levy measure satisfying (7). Covariances are available from (21b). 
Nearly all of the commonly used isotropic geostatistical covariance functions 
[see Chiles and Delfiner (1999), Section 2.5] may be achieved by the choice 
of a suitable generating kernel g{x, •) and Levy measure z/((i/3 du); see Clyde 
and Wolpert (2007) for specific examples. 

2.5.4. Symmetric a-Stable random fields. Symmetric a-Stable (SaS) Le- 
vy random fields have Levy measure 

(24) u{d^duj) = Caa\(3\-^-''d(3j{duj) 

on M X 57 for some < a < 2 and cr-finite positive measure j^du), where 
Cq, = (l/7r)r(a) sin(7ra/2), giving C{A) ~ St(a, 0, 7(A), 0) [in parametriza- 
tion (M) of Zolotarev (1986), page 11] with infinite variance (and thus no 
meaningful covariance function for /(x) = C[g{x)]). This infinite Levy mea- 
sure satisfies (9) for all < a < 2, but satisfies the stronger local Li con- 
dition (7) only for < a < 1; thus compensation is required to construct 
SaS random fields with 1 < a < 2, including the Cauchy case of a = 1. One 
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can show that f{x) is well defined for any 4>{x,-) G La{^,^{du))), including 
the generating functions of (4). The SaS fields have heavier tails than, for 
example, the symmetric Gamma fields of Section 2.5.3, and may be more 
appropriate for problems where one might expect /(•) to include by a few 
heavily weighted kernels. 

3. Approximations for implementing kernel integrals. Computer simula- 
tions of Levy random measures A i— )■ C{A) and random fields (f> i— )• C[(f)] asso- 
ciated with finite Levy measures v may be constructed as in (5), (13), simply 

by setting i/+ = x Q) and drawing J ~ Po(zv+) and {{pj,ujj)}o<j^j 
7r{d/3duj) = u[dl3 duj) /v~^ . If z^(]R xQ) = oo however the sums in these equa- 
tions will include countably infinitely-many terms, and may not be abso- 
lutely summable. We now construct an approximating set of finite Levy 
measures {ve} indexed by e > and show that the approximate Levy ran- 
dom fields >Ce[(/)] converge to the random field C[(j)\ given in (14). Note that e 
is not a model parameter. It is only a device used for two purposes: as a tool 
in the theorems constructing LARK models (in this section) and establish- 
ing their properties (in Section 4), and to enable the construction of practi- 
cal numerical methods to approximate LARK models within specified error 
bounds (in Section 5). 

Theorem 2. Let v he a Levy measure defined on M x 17 satisfying (9) 
and G $ satisfying (16). Take {i^e} to he any family of compact sets in- 
creasing to Q as e — )• 0, and for any Borel sets Ad^ and S C M and let 
he the unique Borel measure on M x satisfying 



for S C M, A C ^ [note vf = Vf^iM. x Q) < oo]. Let h{-) be any bounded 
measurable compensator function on M satisfying h{fi) = (3 + 0{I3'^) for (3 
near zero. Then as e — )■ 0, the random variables 



converge in probability to £.[(p] of (14)- 

Proof. The error in approximating C[(p] of (14) by Cs[(j)] of (26) is 



(25) 



i^e{B xA) = u{{Bn[-e,ef) x (AnKe)) 



(26) 





(27) 
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where = {{/3,u}) : |/5| < e or w G K^}. The first term in (27) converges to 
zero almost surely, and the second in Li, as e — )• 0; see the Appendix for 
details. □ 



The approximation Cs[(j)] is the sum of a Levy random field with finite 
Levy measure [hence with simple representation (13)] and a deterministic 
drift term Ss[(j)] given by the second integral in (26). The drift vanishes 
whenever z/(d/3 dui) is symmetric in zb/3 and h{(3) is odd. 



Corollary 2. If either (a) v{dj3(Lj) satisfies (7), or (b) u{d/3duj) sat- 
isfies (9) and is even in ib/3, and also h{p) is an odd function, then for each 

(28) fe{x)= 9{x,io,)Pj 

0<j<Je 

with 

Je ~ Po(l/+), {Pj,^j}o<j<Je I Je l^e{dP duj) / uf 

converges to f(x) in probability as e — )■ 0. 
Proof. With /,(x) = ^.[^(x)], 
feix) = / g{x,uj)Ce{dhj) 



(29) 

= / / g{x,Lo)pMe{dl3du:)- [ [ g{x,Lo)h{P)u,{d^ du) 

with M£{df3 du) ~ Po{i'£{d/3 du)). If u satisfies (7), then without loss of gen- 
erality take the compensator function h{(3) = 0. In both cases (a) and (b), 
the second integral in (29) vanishes, leading to (28) [cf. (2)]. □ 



Note that in case (b) the {g{x,ujj)f3j} are not absolutely summable so 
"^°^Q5f(x, cijj)/?/' does not converge in the Lebesgue sense. In each of our 
applications the conditions of Corollary 2 hold, allowing us to approximate z/ 
by a finite Levy measure Vs [sind £ by jC^ ~ Levy(z^£)], and exploit the re- 
sulting Poisson representation for inference. 



4. Function spaces for LARK models. Theorem 2 and Corollary 2 estab- 
lish pointwise convergence of fe{x) to f{x) as e — >• 0; in this section we pro- 
vide conditions to ensure that /e(-) ~^ /(') ™ appropriate Besov or Sobolev 
norms if the generating functions lie in the same space. 
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For s > and d G N denote by W|(M ) the Sobolev space of real- valued 
square-integrable functions /(•) G L2{W^) [Sobolev (1991), Section 1.7, Reed 
and Simon (1975), page 50] with finite Sobolev norm 

(30) ||/||„. = {^/_;i + |ai/«)P^}"' 

with Fourier transforms defined for / G Li(M'^) by 

fiO= [ e^^--f{x)dx 

and by L2 limits for / G ^2(1^"^); here and dx denote the Lebesgue volume 
element in M*^, and ^ • x denotes the Euclidean inner product. Each is 
a Banach space, hence complete. By Plancherel's theorem, each / G W| with 
s > has s distributional derivatives in L2(M), and by Sobolev's lemma has k 
continuous derivatives for each integer < k < s — d/2. 

Besov spaces constitute a flexible family that includes elements with wide 
spatial irregularity. The Besov space Mp^ consists of those / G Lp{M.'^) whose 
Besov semi-norms are finite. Several equivalent Besov semi-norms appear in 
the literature [Triebel (1992), Theorem 2.6.1, page 140]; we use the definition 
given as equation 2 of that theorem. For p,q>0 and s > d{l/p — 1)+ and 
for any integer m> s {m = 1 -|- [sj is easiest), set 

\f\u={l \hr'\\Atf\\idh/\h\'' 
\j\h\<i 

or, in dimension d=l, 

(31) \f\;^=(^2j\~'-^%^^f\\idhj'\ 

where A™ denotes the mth forward finite difference, 
AO/(x) = /(x), 

(32) A^f{x) = [A^-'f{x + h)- A™-V(x)] 

The Besov space B^^ is the Banach space completion of ^^(IR'^) under norm 

(33) 11/11;, = ll/llp + 1/1;,. 

For p = q = 2, M^^ coincides with the Sobolev space Wl. 

For fixed u Cl, each of the kernel functions g{-,uj) in (4) is in B^g for 
all p,q>l and some s > 0, and hence each finite approximation of the form 
(28) lies in the same Bp^. For example, the Gaussian kernel of (4a) (along 
with its (i-dimensional generalization) satisfies gc{-,uj) G B^ for every s < 00 



14 



R. L. WOLPERT, M. A. CLYDE AND C. TU 



and p, (7 > 1, while in the double-sided Laplace kernel of (4b) satisfies 
gL{-,(jj) G for s < 1 + 1/p < 2 for integer p and the Haar wavelet of (4d) 
is in only for s < 1/p. To simplify proofs in Section 4.1, we will restrict 
attention to generating functions (7 on M'^; these results may be extended to 
bounded domains the Besov semi-norms defined in terms of differences on 
bounded domains in Section 5.2.2 of Triebel (1992) may be used to extend 
these results. 

We now provide conditions for LARK models to be in the same Besov 
space as their generating functions. 

4.1. Convergence of LARK models in Besov spaces. 

Theorem 3. Fix g G B^q(M'^) for some p,q > 1 and s > and a Levy 
measure v on M x $7 with = (5^ x M'^) of translation-invariant product form 
i'(d/3 dio) = u{d(5 dA) dx [here lo = (A, x) ] for a a-finite measure i/{d/3 dA) on 
M X 5^ that satisfies the integrability condition (7). Define a location- scale 
LARK model /(•) on X = M.'^ by: f{x) = f^4>{x,uj)C{doj) where (/)(x,aj) = 
g{A{x — x)) satisfies ( 18) for each fixed x £ X . Then f has the almost surely 
convergent series expression 

(34) fix) = Y,9{^j{^-Xj))f3j 

j 

and f G B^^ almost surely if v satisfies 

(35a) [[ {lA\f3\\A\-^/P)D{df3dA)<oo, 

J JrxS^ 

(35b) // {lA\^\\A\'-^/P)D{d/3dA)<oo. 

J JrxS^ 

Proof. Equation (18) ensures that the sum in (34) will converge al- 
most surely for each fixed x £ X, with a finite number of terms \g{Aj{x — 
Xj))/3j| > 1 and infinitely many, but absolutely summable, terms with 
\g{Aj{x — Xj))Pj \ ^ 1- The Lp norm of / satisfies the bound 

j j 

by the triangle inequality and Proposition 2 in Appendix A. This is finite 
almost surely by (35a) since g G B^^ C Lp. The Besov semi- norm of / is 
bounded by 

\f\u<Y.\^MH^-x,))\;, 
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by Proposition 2; changing variables h^t = A/i, this is 

(36) =^i/3^.||A,r-vp( I itr'^-^iiArffii^diV^'. 

The integral in (36) is bounded by 

\t\-<i-'<i\\i^^g\\idt= [ \t\"'^~"i\\A^g\\ldt 



p 

t\<i 

+ / \t\-'^-"i\\A'^g\\ldt. 
J\t\>i 

The first term is just {\g\pq)'^, and (32) implies ||A™g||p < 2™||5f||p, so 

<(bi;,)'^+ / \t\-''-'H2^\\g\\pydt 

J\t\>l 



< (clbli;,)" 



IS )q>_ Z 1|„||9 



for some c < oo , so 

(37) l/i;,<c||<7li;,El/'^-|l^:'-|'"'^'' 

3 

which is almost surely finite by (35b). □ 

Each of the kernels g{-,oo) considered in the examples in Sections 7 and 8 
may be shown to be in some Besov space B^^, and each is bounded by 
Iblloo < 1- Corollary 3 establishes that each of our LARK models with a Levy 
measure that satisfies (7) is in the same space B^^ as its generating function. 

Corollary 3. Let f{x) = J (j){x,uj)C{duj) be a one- dimensional LARK 
model on a compact set X C M} , with product Levy measure ^{dfSduj) = 
ui3{d(3)'K\{dX) dx onWx M+ x X satisfying (7) with Gamma probability mea- 
sure 7rx{dX) = Ga{a\,bx) and location- scale generator (j){x,uj) = g{X{x — x)) 
with bounded g G B^^. Then f G B^^ almost surely if a\ > 1/p for p,^ > 1 
and s> 0. In particular, if a\>l then f G B*^ if 9 ^ B*^ for all p,q>l and 
s>0. 
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Proof. Equation (18) holds for bounded g € with Levy measures of 
the form indicated; the conditions on ax ensure that also \~^/^TT\{d\) < 
oo and J^^ X^~^/P7Tx{dX) < oo, so the bounds of (35) hold. □ 

4.2. Comparisons with Abramovich, Sapatinas and Silverman. The sto- 
chastic wavelet expansion of Abramovich, Sapatinas and Silverman (2000) 
may be viewed as a LARK model using wavelet generator (4e), with coeffi- 
cients that, when conditioned on the scale parameters {aj}, have indepen- 
dent Gaussian distributions No(0, ca~^) with to = (a, 6) G [ao,oo) x 
[0, 1) and ^^{duj) oc a~^l|„>„,j} dbda for some c, 5, ^ > 0, (5 + ^ > and oq > 1- 
The parameters 6 and control the size and frequency of wavelet coeffi- 
cients and determine whether the expansion will have a well-defined limit. 
For a finite Levy measure Vi^{doj) (^ > 1), the expansion will be in the cor- 
responding Besov space of the generating wavelet with probability one. For 
^ < 1, the Poisson mean is no longer finite; however, Abramovich, Sapatinas 
and Silverman (2000) provide conditions on 6 and ^ so that / falls in the 
corresponding Besov space of the generating wavelet. 

For "simplicity of exposition," Abramovich, Sapatinas and Silverman work 
with functions of unit period [i.e., satisfying g{x) = g[x + 1)] and regard 
them as functions on the unit torus T, the interval [0, 1] with the endpoints 
identified. We now illustrate how the LARK theory may be used to prove 
that the resulting expansion lies in ]Bpg(T) if the generating function does. 
The Besov sequence norms used by Abramovich, Sapatinas and Silverman 
and others are natural for the Gaussian distributions and discrete wavelet 
expansions they study; we have found the (equivalent) function norms to 
be more convenient for continuous wavelet expansions using non-Gaussian 
(a-Stable, e.g.) distributions used for the coefficients in our expansions. We 
follow Nikol'skii [(1975), Sections 1.1.1 and 4.3.5] in defining Besov norms 
on the torus by replacing the Lp norm on M with that over T in the defini- 
tion of the Besov semi-norm and norm [see (31), (33)], and in denoting the 
corresponding spaces by L*p{T) and IBp*(T), respectively. 

To simplify the proof, we will use the following lemma. 



Lemma 1. Let TTz{dz) denote the standard normal distribution on M, let 
g £ L*{T) with p>l and let r £ {0, 1}. Then 

(1 A \zg{uY\X~'')\~''TTz{dz) dX du <oo 

x[l,oo)xT 

for any a G M if b> 1, and for all a> 1 — b if b<l. 




The proof is given in Appendix A.l. 
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Theorem 4. Let g G ]Bp*(T) for some p,q>l and s > 0. Let C{duj) be 
a random field on 0,= [l,oo) x T with Levy measure 

(38) u{dp dX dx) = ^ A'5/2-Ce-/3^AV2 

V 27r 

on M X O with (5, C > 0. T/ien i/ie L^i^iiT model f{x) = X^/'^g{X{x - x))C{duj) 
has an absolutely convergent expansion 

(39) /(x) = ^/3jAf'5(A,(x-Xi)), 0<x<l, 

j 

provided that > 1 — ( for < C < 1; or for any 6 >0 if C, > 1. Also 

/(•) e ®pg(T) a/most siire/y /or ^ > s + 1 - C < C < 1 or for any 5>^ 
if C>1. 

Proof. The absolute convergence of (39) for each x will follow from 
Proposition 1 if we can verify the conditions of (18), that is, finiteness of the 
integral 

(40) /// ilA\l3X'/^g{Xix-x))\Md(3dXdx). 

J J JRx[l,oo)xT 

Applying the change of variables /3 i— >• 2; = A''/^/3, 

(41) = /// (lA|z|A(i-'5)/2|5(A(x-x))|)A-'^7r,(dz)dAdx, 

J J JRx[l,oo)xT 

where 7rz{dz) is the standard normal distribution. Since the term in paren- 
theses is bounded by one, (41) is finite for all 6 and g if C > 1. For < C < 1, 
apply another change of variables x 1— t- n = A(2; — %) and apply periodicity 

00 rXx 

/ (1 A \zg{u)\X^^-^y^) duX^^-'^ dXTT.idz) 

Jx{x-1) 

which, due to periodicity, satisfies the bound 

< / {I A\zg{u)\X^'^~^^/^)du\X]X~^"^ dXTT.idz) 
JrJi Jo 

< 2 /"/"/" {lA\zg{u)\X^^~^'^^^)duX-'^dXTr,{dz), 

J J JRx[l,oo)xT 

where [A] denotes the least integer > A. By Lemma 1 this is finite for < 
C < 1 if ^ > 1 - C with g G B^*, so (18) holds and Proposition 1 ensures 
convergence. 

The L* norms of the mth forward differences of a periodic function g{-) G 
Mp*{T) and their scaled translates g{X{- — x)) for x C T and positive scale 
A G [1, 00) are related by 

(42) ||A;r5(A(--x))li;<2^/^||A^,5li; 
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since, by a change of variables x i— )■ u = X{x — x), 

\\A^9{x{--xm; 



/■A(l-X) ™ 



1/p 



du 



which, again from periodicity, satisfies 
l/p ( A 



< 



du > 



IIAUsll 



while rA]/A<2. 

The Besov semi-norm of / is bounded by 

\f\;i<Y.\f^Mf\9i^,i--xj))\; 

3 



I s* 
\pq 



1/2 



h\<l 



\hr^-^^^^g{x,{--xm7'^h) 



< 



i/p 



|ft|<i| / 



(43) 



\ 1/9 



The integral in (43) is bounded by 

:\-'-^^ATg\\;ut = 



t\<i 



\t\-'-^^ATg^dt 



+ 



t\>i 



The first term is just (Iffl;;)", and (32) implies ||A5"c/||; < 2'"||5||;, so 

<(bi^:)^+ / \t\-'-''{2^\\g\\;ydt 



t\>l 

^1 



< ic\\gr;,r 
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for some c < oo, so 

(44) \f\;i<2c\\g\\;iY.\f^,\X;^'^' 

j 

is almost surely finite if and only if 




Sx[l,oo)xT 

is finite. Applying the change of variables /? i— )■ z = X^/'^P, 

(1 A \z\X'+'^^-^^/^)X-^Ti,[dz) dXdx 

x[l,oo)xT 

is finite by Lemma 1 for all 5 > if C > 1 and for ^ >s + l- CifO<C<l- 
A similar argument shows that the L* norm of / satisfies a bound of the 
form 



;<c\\g\\;Y,mx 



1/2 

'J 
j 



for some c < oo. This is finite almost surely if 

{lA\(3\X^/^)u{dl3dXdx) 

x[l,oo)xT 

{1 A\z\X^^-^^^^)X-^TTz{dz)dXdx 



5x[l,oo)xT 

is finite, which follows from Lemma 1 for all > if C > 1 and, if C < 1, for 5 
satisfying > 1 — ^ since g G B** C L*. Combining conditions, the B** 
norm of / is finite if 5/2 - 1/2 > s + 1 - C for < C < 1 and for all (5 > 
ifC>l. □ 

For Levy measures v[d(3 dXdx) supported on M x N x T (i.e., for which A 
is almost-surely integral) the function /(x) of (39) would inherit periodicity 
from the generator g{Xj{x — Xj)) but, for the absolutely-continuous measure 
of (38), it is the definition of f(x) as a function on T [as in Abramovich, 
Sapatinas and Silverman (2000), equation (2)] that induces periodicity. The 
restriction to A > 1 may be relaxed to the more natural A > in the LARK 
framework, but may require the use of compensation. 

4.3. Compensation. For Levy measures satisfying only the local-L2 bound 
of (9) and not the local-Li bound of (7), we must use the definition of f{x) 
in (14) and use (16) to establish conditions that ensure / will be well defined 
for g G Bp^. We verify these conditions for the existence of LARK models 
under symmetric a-Stable random fields. 
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Theorem 5. For a Symmetric a-Stable random field with Levy measure 
of the form u{d(3 duj) = Caa|/?|-^-" d(3TT{dA) dx onM. x S^xR"^ forO<a< 2, 
with 7r{dA) a probability measure on Sf and g G ]Bpg(M'^)nLi(R'^) forp, q>l 
and s > 0, the conditions of (16) for f[x) to be well defined by Theorem 1 
are satisfied for 1 < a < p, a <2 if E[|A|~^] < oo. For a = 1, there is the 
additional requirement that 



(45) / \g{u)log\g{u)\\du<oo. 

Proof. Fix x £ X. By the affine change of variables oix'~^ = ~ x)? 
{lA\P(l){x,io)\)u{dpdio) 

{lA^\g{u)\)r^~''d(3du 



2c„a / |ArV((iA) / / 
Js^ J J[i, 

2c„aE|A|-i [ { [ 



[l,oo)xl 



l9(«)r' 



1 



For 1 < a < 2, 



which is finite for 1 <a <p since g £ Li and g £ C Lp. For a = 1, 

= 2ciE|A|-i|/" -\g{u)\log\giu)\du+ [ \giu)\du]. 

The first integral exists and is finite by (45) while the second is finite since 
g G Li. Similarly, the integral in (16b) is 

{\P<j){x,co)\ A \f3^{x,Lo)\'^)u{df3dLo) 

'i,i]xn 



2co^aE\X\-^yj^^^ 



(|/35(u)| A|/3<7(u)|2)/3-i-°d/3dn|>. 



The integral in braces 



f3^~'^g{ufdf3du+ /3-''\g{u)\d(3 du 

[0,lA|g(n)|-l]xR<* J J [lA\g{u)\-'>- ,l]xR'l 

is finite for 1 < a <p, a <2: 

< r \9{u)r I \g{uT-\g{u)\ 

~ 2 - a a - 1 

II IIP 
\\9\\p 

- (2-a)(a-l) 
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while for a = l 



</ {\9{u)\ + \g{u)log\g{u)\\}du<oo 



by (45). Finally, (16c) holds because 



/ / {lA^^)\(l){x,u})\iy{dpdu) 



J JRxn 



= E\A\~^Caa [ [ {lAP^)\f3\-^~'^\g{u)\d(3du 



= E|A|-^||g||ic„a / (1 A/5=^)|/3r^^"d/5<oo. 



□ 



All of the generator functions in the examples in Section 7 satisfy the 
conditions of the theorem for the Cauchy random field (a = 1), so the LARK 
models are well defined as e — )■ and for finite e > 0, the approximations are 
in the same Besov space as g. We are able to show that this also holds for 
Sobolev W2 spaces (which are equivalent to B22) even when compensation 
is required, but this remains an open question for B*^ with general p and q. 

4.4. Convergence in W|. 

Theorem 6. Let {4){x,uj)} be a location- scale family of the form 4>{x,uj) = 
g{A{x — x)) for a; = (x. A) with x G and nonsingular dxd matrix A e 5^ 
for some function g{-) G W| with s > 0. Let u he a Levy measure satisfying 
the condition 



where p{A) denotes the spectral radius (largest eigenvalue) of A. Recall 



(46) 




(19) 





and, for £ > 0, define 




(47) 
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Hx,^^j)l3j- (j){x,u})h{/3)r]{df3duj). 



0<j<Je 

e<\l3j\ 

Then /£(•) — )• /(•) in almost surely as e— ?■ 0. 

Proof. First, consider the case of compensator functions satisfying 
/i(/3) = /3 for all |/3| < 1. Apply an affine change of variables to see that 
<j){x,uj) has Fourier transform (in x) 

For < ei < £2 < 1 and rr E M'', set A(x) = fe^ (x) - fe^ {x) and let A = {£i< 
|/5| < £2} X VL. Then 

A(x) = ^ (/){x,ujj)l3j - (t){x,uj)l3v{dl3duj) 

ei<|/3j|<e2 

is a zero-mean random function of x with Fourier transform 

0<j<J,^ 
ei<|/3j|<£2 

e'^-^\A\-^g{A-^^)l3i^{dpdu;), 
a zero-mean L2 random function of ^ with second moment 
(48) E|A(OP= / / \A\~^\g{A-'0\''l3MdPdu;). 



Thus A(-) has expected squared Sobolev norm E||/e^ — /; 



|2 



£2 I 



'2 



= (2^)-^ / / / {l + \^mAng{A-'Orf^MdPdu;)d( 

J J JR^xA 

= (27r)-^ /// il + \Aij\y\A\-^\giv)\^f3Mdl3du;)dv 

< (27r)-'^ /// il + \r^\y[{l + piA)f']\A\-'\g{rj)\'(3MdPdu)dv 

(49) = IIGIli,. / / [{l + p{A)f']\A\-'pMd^du;) 

-^0 as £i,£2 by (46), 

so {fsk} is a Cauchy sequence in for any £/c — and ||/ — fe^Ww^^ ~^ ^' 
Since fs is a finite linear combination of scaled translates oi g € W^, each 
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(and hence /) lies in W| almost surely and Theorem 6 is proved for com- 
pensator functions satisfying h{/3) = j3 for < 1. 

For an arbitrary bounded compensator /i(/3) satisfying |/3 — /i(/3)| < cfi'^ 
for some c > 0, (48) has the additional nonrandom term 



e*«-^^^^^(/3-/i(/3))Kd/3d^) 



leading at most to an additional constant factor of [l+c / /]g^f^(lA/3^) 1^(^/3 du))] 
in (49), leading as before to ||/ — /gfe ||w^ — ^ and completing the proof. □ 

Corollary 4. If {(j){x,uj)} is a location- scale family of the form con- 
sidered in Theorem 6 and if a Levy measure v is of 'product form v[d^diS) = 
vp{dl3)TT^{dLo) for some a-finite measure i'i3{df3) on R and probability mea- 
sure 7Ti^{-) on Q that for some s > satisfy 

(50a) / {1 A I3^)u,3{dl3) < oo, 

(50b) [ \A\"\il + p{A)f')7TUdu:)<oo, 

then u{dj3duj) also satisfies (46) and hence fs{-) — )■ /(•) in W| almost surely 
as e — )• 0. 



For example, in one dimension, (50b) is satisfied for all s > if A = A has 
the Xf distribution with u > I degrees of freedom, that is, if ~ Ga(a;A, P\) 
with ax> ^- More generally, for any m> (50b) is satisfied for all s > if 
A'" ~ Ga{ax,/3x) with ax > 1/m or, for m < 0, for oa > (1 — 2s)/m. 

Recall that the quantity e introduced in the proof of Theorem 6 and the 
statement of Corollary 4 is not a model parameter and has no bearing on 
the Sobolov spaces to which the limiting function /(•) belongs; it is only 
a tool used in proofs and implementations, to which we now turn. 

5. Inference for LARK models. The LARK model introduced in Sec- 
tion 1 may now be summarized as 

(51) E[Y{x)\C,e] = f{x)= [ (P{x,oj)C{doj), 

Jn 

£ I ^ ~ Levy(i/), 
e-^7re{de) 

with implicit dependence of the Levy measure i/((i/3 duj) and conditional 
distribution for Y(x) on a hyperparameter vector 6. In all of our examples, 
we take to be a product measure u{d(3dijj) = z/^(/3) (i/3|r2|7rj^(da;) satisfying 
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the conditions of Corollary 2, with 7ruj{-) a probability measure on |0| 
a measure of the volume of fl, and 1^/3 (•) > a nonnegative density function 
on M satisfying f^{l A d/3 < 00 [so u satisfies (9)], for which either 

(a) u also satisfies (7) or (b) 1^/3 (/3) is even and /i(/3) is odd in /3. Thus, we 
have the representation 

(52a) 9 ~ Trg{d9), 

(52b) ~ Po(z^+), i/+ = i/e(Mxrj), 
{{(^j,^j)}o<j<J I J,9'''^' TTfs{(3j)dl3j7r^{dujj), 

(52c) 

7r;3(/3) = l{|^|>,}Z./3(/3)|l^|/z.+ , 

I / '~ PY{y I f{xi))dy, 

(52d) 

/(^^i) = X] 4>{xi,ujj)l3j 

0<j<J 

for sampling model py(- | fJ-) parametrized by fi. 

5.1. Examples of Levy random fields. Motivated by the applications in 
Section 8, we now focus on LARK models built on approximations to Gamma, 
symmetric Gamma and Symmetric a-Stable (in particular, Cauchy) Levy 
random fields, and quantify the approximation errors to facilitate the selec- 
tion of e and other prior hyperparameters. 



5.1.1. Gamma LARK models. The Gamma random field of Section 2.5.2 
has i'is{d/3) = 7/3~^e~'^''l{^>o} for some constants 7 > and rj > 0. The 
parameter 77 in (22) controls both the Poisson rate of mass points {{f3j,ujj)} 
of magnitude |/3| > e and the probability distribution of those magnitu- 
des {f3j}. To facilitate elicitation we disentangle those two roles by trun- 
cating at |/3ry| > e (rather than |/3| > e); of course the limit as e — )• is the 
same. The distributions of J and {f3j} are now given by 

J ~ Po(z.+), z.+ =7|0|Ei(e), 

iid (3-~^e~^^^ 

Pj ' ~ ■ 7^/3 (^i ) , TT^ (/3j ) = ^-^^^^s^ 1 {/9i ' 

where the exponential integral function [Abramowitz and Stegun (1964), 
page 228] is denoted as ^i{z) = f^t~^e~^dt. With this truncation, the 
expected square L2 norm of the loss due to truncation for any (p G ^2(1^, 
\^l\TTi^{duj)), such as (j){uj) =4>{x,U}), is 

E\C[^]-CM\'= [ [ cl>{iofW\%m<s}Hd/3du:) 
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(53a) =\ml r'%^vp{P)dp 

Jo 

= lV-'Ml[l-il + e)e~% 

showing the rate at which Ce[(j)] — )• C[cj)] in L2 as e — )• 0. This is used in 
Section 5.2 to guide the ehcitation of hyperparameters. 

5.1.2. Symmetric Gamma LARK models. The symmetric Gamma ran- 
dom field of Section 2.5.3 has Levy measure ^^{df}) = ^\f3\~^ e~^^^'^ d(5 for 
some constants 7 > and > 0. Once again truncation at \[iri\ > e leads to 

J ~ Poi4), z.+ = 27|0|Ei(e) 

iid |/3-|~-'-e~'^j''' 
/3j '~ • TT^iPj) d/3j, 7r/3(/3j) = ^^^^^^^ — ^mv\>e} 

and expected squared discrepancy (used for elicitation) 
(53b) E\m - CM\' = '2lV~'ml[l - (1 + e)e-^]. 

5.1.3. Symmetric a-Stable LARK models. The SaS Levy random field 
of Section 2.5.4 has Ufs{dl3) = ^r(a) sin ^l/?!"""-*^ ci/3 for some constants 
7 > and < a < 2. To facilitate elicitation and posterior inference, we 
write 7 = 777"" and (again) truncate at \/3jr]\ > e. This leads to 

J ~ Po(i/+), z^+=7|0|-r(a)sin^e-'^ 

TT 2 

iid (y.S^ 
Pj '~ ■ T^pil^j) d(3j, 7r/3(/3j) = ■^\(3j\~°'~^i{\f}^r^\>e} 

with symmetric Pareto distributions for the coefficients {/3j}. For the Cauchy 
{a = 1), these simplify to i^^ = 2j\n\/{Tre), with 

Although the total variation |>C| is almost surely infinite, and even \C — Cs\ 
will be infinite for a > 1, still for </> E L2(il, \ i}\TTi_j{duj)) the expected squared 
discrepancy is finite: 

E\m-CM\^= [ [ H^f\/3Wi(Sr,\<e}HdPdu;) 
J Juxn 

(53c) 



T(a + 1) . na 2_„ 
sm — e 



7r(2-a) 2 
or 27r7~^||(/)||2[e/vr] for the Cauchy case a = l. 
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5.2. Prior elicitation of hyperparameters. We now turn to the selection 
of e > 0, the vector 9 £ Q of (52), and the Levy measure I'^dp duj). In each of 
our examples 9 = (7, rf) for rate parameters 7 and r] governing the frequency 
and magnitude of coefficients respectively, and the expected squared 

truncation error for Ce[(t)\ for </>(aj) = (j){x,uj) is of the form E|£[(/)(x, •)] — 
£e[(/)(x, = 7r/~^||(?l)(x, •)||2c(e) for some c(e) > with c(e) — > as e — ^ 
[see (53)]. 

We choose prior distributions to attain three goals: (1) desired range of 
number J of terms in the stochastic expansion; (2) desired range of coefficient 
magnitudes and (3) tolerable expected truncation error. We first select 
a Levy family (Gamma, a-Stable, etc.) to meet the needs of a particular 
problem for symmetry or positivity, sharp or heavy tails, etc. Each of our 
Levy measures is of the product form v{dl3 duj) = Vjj{dl3)ii^^{dLo) considered in 
Theorem 6 and Corollary 4, with location, scale, and perhaps other location- 
specific (and hence adaptive) attributes encoded in w G in problem-specific 
ways. 

Hyperparameters in the Levy measure vp{d(3) govern sparseness for LARK 
models, that is, the number J of terms in the stochastic expansion. In each 
LARK model, J has a Poisson distribution with mean proportional to 7. 
The coefficient of variation under the Poisson distribution falls to zero as 
the mean increases, overstating the prior certainty for large values of EJ. 
To ameliorate this, we introduce an additional layer of hierarchy by placing 
a Gamma prior distribution on the parameter 7 ~ Ga(a^, 6^), leading to the 
overdispersed negative binomial prior distribution for J~ NB(aj,pj). The 
parameter 77 governs the scale of the coefficients and hence the range 

of the regression function /(•). We employ a Gamma distribution for the 
scale parameter rj~^ ~ Ga(a^,6^). Together the hyperparameters e, a^, h^, 
Qrj, brj determine the prior distributions for J, for the coefficients {f3j} (and 
hence the range of /(•)), and for the expected mean-square truncation error. 
We select values for these five parameters to meet five criteria: attain two 
specified quantiles (such as a central 99% interval) for each of J and {(3j}, 
and a specified bound on the expected truncation error E'yT]~'^\\(p{x, •)ll2'^(^)- 
Typically this involves an iterative numerical solution. 

As a default choice, we take ^{duj) = TT^{dx)'^xidX) to be the product of 
the uniform distribution for locations x ~ Un(A:') and a Gamma distribu- 
tion for inverse (distance) scale parameters A~ Ga(aA,6A)- The shape and 
rate hyperparameters a\ and b\ govern the range of probable values for the 
location-specific inverse scale parameters {Xj} and hence for the smoothness 
of /(x), similar to how bandwidth selection governs smoothness in other ker- 
nel methods. A kernel at ujj = (Xj^^j) will represent a feature located at Xj 
of width 1 /Xj , so large values of Xj are needed to fit a very "spiky" part of 
a curve, while a smoother part of a curve may be fit most parsimoniously 
using small values of Xj. The prior distribution for Xj must support an ade- 
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quate range of values in order to fit a spatially inhomogeneous curve. Values 
of OA > 1 will ensure E[A] < oo and a finite covariance function; we choose 
{ax,bx) to attain two specified quantiles, such as a central 99% interval. 

5.3. Posterior inference. The joint posterior density of all parameters 
under the LARK model of (52), given observations Y = {1^}, is 

P{7,V,J,I3,^ I Y) 

exp[-i/e(M X Q)] 



(54) oc7r^(7)7r^(r/)- 



J! 



^ 4>{xi,UJj)/3j 
0<j<J ■' ^iel ^ 0<j<J 

The posterior (and full conditional) distributions of the parameters are not 
available in closed form. Since some of our parameters (/3 and u) have vary- 
ing dimension, some form of trans-dimensional Markov chain Monte Carlo, 
such as a reversible jump (RJ-MCMC) algorithm [Green (1995), Wolpert, 
Ickstadt and Hansen (2003), Sisson (2005)] must be used to provide sam- 
ples from (54) for posterior inference. See Appendix B for a sketch of the 
RJ-MCMC algorithm. 



6. Relation of LARK to other models. 



6.1. Gaussian processes or random fields. For any positive Borel mea- 
sure T,{duj) on a complete separable metric space il., there exists a Gaussian 
random measure Z{duj) on that assigns to disjoint Borel sets C of 
finite measure S(Aj) < oo independent mean-zero Gaussian random vari- 
ables Z{Ai) ~ No(0,S(Ai)) of variance EZ{Ai)'^ = T,{Ai). For any kernel 
function g on X xQ with (j){x, •) G L2{^, S((ia;)) for each x £ X, this induces 
a mean-zero Gaussian random field through the Wiener stochastic integral 

fix) = / (l){x,uj)Z{duj) 
Jn 

with covariance C{x,y) = E[f{x)f{y)] = J^<j){x,uj)(j){y,uj)Ti{duj). The Gaus- 
sian random measure Z{duj) is the special case of a Levy random mea- 
sure C{duj) defined earlier in (8) with 6{du}) = and v{dfidjjj) = 0. 

A wide variety of Gaussian processes are available in this form. For ex- 
ample, those with stationary covariance C{x,y) = c{x — y) may be written 
in the above form if the spectral measure has a density function c{uj) = 
J-y e~*'^'^c(x) dx whose square root is Lebesgue integrable, for example, the 
Matern class [Stein (1999), page 31] in M"^ with smoothness parameter 
u > d/2. The Gaussian random field model above may also be obtained 
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as the limit as a —t- 2 of the symmetric a-Stable LARK models considered 
herein, providing an alternative method for inference that avoids the need for 
large matrix inversions. To maintain a unified computational approach, we 
have limited our attention in this article to LARK models with pure-jump 
Levy random measures, that is, S(-) = 0. 

6.2. Compound Poissons and mixtures of Gaussian random fields. Mix- 
tures of Gaussian random fields may be constructed as LARK models with 
Levy measure of the form 

(55) u{d(3duj) = (27r(j2)-V2g-/32/2^S d^u^{duj) 

leading to mean functions of the form f{xi) = Yl,Q<j<j 4>{^-,'^j)f^j with nor- 
mally-distributed coefficients [ij\uj Ho^fi^^a"^). For finite measures v^, the 
expansion has a Poisson-distributed number of terms, hence, is a Poisson 
mixture of Gaussian processes (or for hierarchical models with a Gamma 
distributed Poisson mean, a negative binomial mixture of Gaussian pro- 
cesses). In Section 4.2, we showed that the stochastic wavelet expansion 
of Abramovich, Sapatinas and Silverman (2000), an example of (55), may 
be viewed as a LARK model. Chu, Clyde and Liang (2009) extend the 
compound Poisson (or LARK with finite u) model to include mixtures of 
normals distributions for /J^^ and develop methods for Bayesian inference 
for such OverComplete Wavelet expansions (OCW); we compare the OCW 
method to other LARK models in the simulation study of Section 7. 

For automatic curve fitting using splines and wavelets, Denison et al. 
[(2002), Chapter 3] used a similar hierarchical model with common = a, 
but truncated the (Poisson-distributed) number of terms in the basis expan- 
sions at some fixed upper bound Ju- Taking J.^ — )• oo leads to the Gaussian 
LARK model of (55) with a common variance. Gaussian processes have 
sharp tails, of course, leading to concerns about robustness when they are 
used as prior distributions in problems with likelihood functions that fall 
off more slowly. Specifying variances for Gaussian prior distributions is non- 
trivial, with large "noninformative" choices leading to the so-called Lindley 
paradox. Denison et al. recommend an inverse Gamma prior on cx^ to avoid 
this well-known problem. This leads to a multivariate Student t distribution 
on the expansion coefficients and, since the prior now has bounded influ- 
ence, provides robustness. The limiting model (as Ju — )• oo) may be viewed 
as a mixture of Levy random fields. 

Rather than using a multivariate Student t for the coefficients, one might 
use "ridge" priors and model the uncertain function /(•) = ^o<j< j 
as the sum of a Poisson (or negative binomial)-distributed number J of ker- 
nel functions (j){-;ujj) with coefficients /3j C(0,r) drawn from a centered 
Cauchy distributions with scale r. To accommodate rough functions /(•), 
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one must be willing to consider large numbers of terms, most of which will 
have small coefficients — under these priors, one must consider large EJ and 
small T. But how small? And what happens if r is made a bit smaller and 
EJ a bit larger? As r — t- 0, if one scales the expected number EJ of terms 
(as a function of r) properly, this model converges to a LARK model with 
infinite Levy measure (and so is not sensitive to the cut-off e, which merely 
quantifies how close is this approximation). If EJ is not scaled properly to 
converge to a LARK model, the limiting results may depend critically on 
arbitrary and unintentional choices. 

This may be implemented explicitly in LARK form by placing indepen- 
dent Ga(a/2,e/2) prior distributions on a~'^ in (55) to achieve independent 
univariate Student tQ,(0,e) distributions for the coefficients {/3j} and (ap- 
proximately, as the parameter e — t- 0) the heavy-tailed Symmetric a-Stable 
process for /(x) of Sections 2.5.4 and 5.1.3 [this also illustrates that truncat- 
ing the support of /S^j is not the only way to construct suitable approximat- 
ing sequences of finite Levy measures i^sidf^duj) u{dl3duj) for which the 
integrals in (27) converge]. An important feature of our infinitely divisible 
construction (in contrast to a compound Poisson approach from other dis- 
tributional families) is that in each e — )• the approximating model 
converges to one with a well-defined prior (with infinite Levy measure) and 
a proper posterior distribution. 

6.3. Finite dimensional frames. LARK may be viewed as a limit of 
Bayesian variable selection methods with finite frames or dictionaries. Wolfe, 
Godsill and Ng (2004) consider frames based on discretizing O as a fine 
grid with |G| elements. They place i.i.d. prior distributions 'KG{P)d(3 on the 
nonzero coefficients and i.i.d. Bernoulli kernel inclusion indicators with in- 
clusion probability pc- If \G\pgt^g{P) — ^ ^{(^) as |G| — )■ oo, then the result 
converges to a LARK model on the infinite-dimensional frame. The rep- 
resentation in Wolfe, Godsill and Ng (2004) uses a point mass at zero to 
provide sparsity. Similarly, one may view the prior distributions in LARK 
under the e-truncation approach as assigning zero mass to a neighborhood 
around zero, also leading to sparse representations. One benefit of LARK is 
its provision of a formal method for coherent prior specification for contin- 
uous dictionaries; a second is its provision of a proper prior specification in 
the limit as e — )• 0, ensuring insensitivity to the choice of e. 

Standard stochastic search algorithms using finite-dimensional frames may 
exhibit poor mixing when the correlations between grid elements tend to ±1. 
To illustrate, suppose that two possible kernel parameters ojq and uji are close 
in parameter space, leading to two highly correlated columns in the design 
matrix. In addition, assume that inclusion of either column leads to nearly- 
maximal likelihood. With the standard one-at-a-time deletion or addition 
moves in many stochastic search algorithms, to move from a model includ- 
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ing a kernel indexed by loq to one indexed by ui would require an extremely 
unlikely deletion followed by an addition (or unlikely addition followed by 
a deletion). LARK avoids this difficulty by allowing the continuous parame- 
ter u indexing dictionary elements to move incrementally from loq to cji by 
a series of update steps, avoiding some of the poor mixing problems associ- 
ated with highly correlated frame elements in a fine-grid based method. 

6.4. Dirichlet processes. The Dirichlet process [Ferguson (1973, 1974), 
Antoniak (1974)] has received widespread use as a prior distribution on 
probability distribution functions. Its popularity is due in large part to its 
analytic tractability in many problems; simulation is straightforward, and 
Bayesian MCMC inference methods are available [Escobar (1994), MacEach- 
ern (1994), Escobar and West (1995), MacEachern (1998), Miiller and Quin- 
tana (2004)]. Liang, Mukherjee and West (2007) consider nonlinear regres- 
sion and classification models E[Yi \ Xi] = f{Xi) for data {(yi,Xj)} using 
kernel expansions of the form 



with random signed measure ^{du) expressed as the integral of a weight func- 
tion w{u) with respect to a probability distribution F, modeled as a Dirichlet 
process F~ DP(Fo,a) with base measure Fq and scale a > 0. If observed 
points {Xi} are viewed as a random sample from F, then updating the 
posterior for F solely on the basis of the observed {Xi\ would lead in the 
limit as a —7- to a degenerate posterior for F concentrated at the empirical 
distribution for X, justifying the finite-dimensional expansion 



with kernels evaluated only at the observed data locations. The generalized 
g'-prior of West (2003) for the coefficients {wi = w{xi)} leads to dependent 
Cauchy distributions for the {/(xj)}. This approach (like the SVM, RVM 
and related approaches) has as many coefficients as there are data points, 
but avoids over-fitting through shrinkage. Asymptotic properties of f{x) as 
n — )• oo are difficult to study in the absence of a limiting structure such as 
that provided by LARK. 

The Dirichlet measure F{du) does not assign independent random vari- 
ables to disjoint sets and so (56) is not a LARK model, but it can be con- 
structed from one. In fact it is exactly the normalized LARK model 



(56) 




n 




1=1 



(57) 
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with F{du) = C{du) / C{Q) for a Gamma random field C{du) with infinite 
Levy measure 

v{dl3du) = a/3"^e"^l{^>o} dpFo{du), 

where := ^(ij [note that w{u) could be absorbed into 

Well-known disadvantages of Dirichlet process models include their inflex- 
ibility (the single parameter a determines the prior dispersion everywhere, 
precluding prior speciflcations with more uncertainty in some regions than 
in others), their discreteness, and the limited variability of the masses as- 
signed to the countably-many support points. The normalized Gamma rep- 
resentation (57) of DP's offers the opportunity to overcome some of these 
disadvantages — for example, the Gamma process may be given a variable 
rate parameter h{u) by taking 

v{dl3du) = r^e-^^'''^^l{pyQ}dl3Fo{du) 

leading to a precision that can vary with location n € $7, or the Gamma 
random field may be replaced with another nonnegative Levy random field 
with wider dispersion, such as the fully-skewed Stable process of index a < 1. 
Other nonnegative Levy random fields are beginning to be used in machine 
learning [Jordan (2010)] and other fields. 

7. Simulation study. We now turn our attention to simulated and real 
examples to illustrate the performance of LARK models in practice. We con- 
ducted a simulation study using four spatially varying functions introduced 
by Donoho and Johnstone (1994) that are now standard in the wavelet liter- 
ature: Blocks, Bumps, Doppler and Heavysine. Data were generated for each 
test function by adding independent Gaussian random noise No(0, CT^) to the 
true target function /(•) at n = 1024 equally-spaced points on X = [0, 10]. As 
in Abramovich, Sapatinas and Silverman (1998), the value of a was chosen to 

attain a root signal-to-noise ratio (RSNR) of f^{f{x) — f)'^dx/a'^ = 7.0, 
where f = fx /(^) Each target function /(•) has a range of approx- 
imately < f{x) < 25. For each function, we generated 100 replicate data 
sets to evaluate the performance of LARK and other methods on the basis 
of mean squared error 

n 

(58) MSE^n-'Y.{f{^)-fixi)f. 

1=1 

7.1. Hyperparameters. In Table 1, we report the kernel functions used 
for the four simulation examples, chosen to illustrate the flexibility of LARK 
to use a wide range of kernels that may be adapted to anticipated features 
(smoothness, spikiness, jumps, curvature, covariation, etc.) of applications. 
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Table 1 

Kernel functions used for four test functions 



Test function 


Kernel ct){xi; Xj , >^j) 


Blocks 


l{0<Aj(3:,-Xj)<l} 


Bumps 


g-Aj ]xi-Xj \ 


Doppler 




Heavysine 


e l{|x.-Xjf<2.0} 



Table 2 

Hyperparameters used in examples of Section 7. 1 



Levy measure 


e 










a\ 




Symmetric Gamma 
Cauchy 


0.0041 
0.0029 


2.53 
2.53 


6.45 
14.2 


13.01 
0.50 


0.71 
1.00 


1.117 
1.117 


0.1965 
0.1965 



In each case, we take O = [0, 10] x (and |Q| = 10), with elements denoted 
oj = (x, A), comprising a location parameter x S A' = [0, 10] and a shape pa- 
rameter A > 0. As described in Section 5.2, we take {xj} Un(r2) and 

{Aj} Ga(oA,6A) with a^, chosen (see Table 2) to achieve a 95% prior 
interval of [0.20, 20.0] for A to attain dilated kernels covering from half a per- 
cent up to fifty percent of X . 

Our choice of the remaining hyperparameters was guided by three objec- 
tives: to achieve a 95% prior predictive interval of [5,100] for J, to achieve 
a 95% prior predictive interval of [—25,25] for the and to achieve 

a limit on the mean squared truncation error of ||-C[</>] —££[</>] || 2 = (E|vC[(^] — 
^e[4>\?Y''^ < 0.05 • ||(/>||2 (see Section 5.2). While these objectives could be 
met for the LARK model with symmetric Gamma prior with the values 
given in Table 2, they are not quite attainable for the Cauchy model — the 
competing goals of an extremely wide distribution for the {/3j} and a low 
mean squared truncation error cannot be reconciled. Upon relaxing the prior 
predictive distribution requirement on {/3j} to a 99.9% interval of [—33,33], 
adequate for this problem with a flat Pareto-tailed distribution for {/3i}, the 
remaining objectives for the distribution of J and the mean square trun- 
cation error were attained using the values given in Table 2. See Figure 5, 
Appendix C for realizations from the prior distribution. 

7.1.1. Performance. We compared LARK with two of the best wavelet me- 
thods currently available for inhomogeneous function estimation using over- 
complete representations: the empirical Bayes approach ("EBayesThresh") 
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Table 3 

Average and (standard errors) over 100 replications of mean square errors of the four 
test functions using the Levy Adaptive Regression Kernels (LARK) using the symmetric 
Gamma and Cauchy priors, the OCW approach using a Laplace prior [Chu, Clyde and 
Liang (2009)], and the EBayesThresh approach using a Laplace prior [Johnstone and 

Silverman (2005a)] 



Method 


Blocks 


Bumps 


HeavySine 


Doppler 


LARK- Gamma 
LARK-Cauchy 
OCW 

EBayesThresh 


0.030 (0.0013) 
0.026 (0.0011) 
0.060 (0.0023) 
0.096 (0.0013) 


0.111 (0.0019) 
0.105 (0.0017) 
0.285 (0.0025) 
0.307 (0.0032) 


0.038 (0.0010) 
0.036 (0.0010) 
0.082 (0.0010) 
0.118 (0.00098) 


0.152 (0.0030) 
0.157 (0.0028) 
0.152 (0.0019) 
0.202 (0.0027) 



of Johnstone and Silverman (2004, 2005a, 2005b) using translational-invariant 
wavelets, and the continuous over-complete wavelet ("OCW") approach of 
Chu, Clyde and Liang (2009) based on the stochastic wavelet expansions of 
Abramovich, Sapatinas and Silverman (2000). We replicated the results of 
Johnstone and Silverman (2005b) under the beta-Laplace prior using their R 
package EBayesThresh [Johnstone and Silverman (2005a)] with Daubechies' 
"least asymmetric" (la8) wavelets [see Section 4 of Daubechies (1988) or 
Section 6.4 of Daubechies (1992)]. OCW uses the same la8 wavelet as 
EBayesThresh except for the Blocks example, where both LARK and OCW 
use the Haar wavelet. The OCW method may be viewed as a special case of 
LARK with a finite nonseparable Levy measure, where coefficients f3j have 
independent Laplace distributions conditional on scale parameters \j , which 
in turn have truncated Pareto distributions. As in LARK, OCW assigns in- 
dependent uniform locations, with a negative binomial distribution for the 
number of terms in the expansion. 

The performance of each method was measured by its average mean 
square error (AMSE), defined as the average value of the MSE given in (58) 
over the 100 replicated simulations. Overall, the performance of the LARK 
model is excellent (Table 3). Both LARK versions generated lower AMSE 
values than did EBayesThresh for all four test functions. LARK also has 
smaller AMSE than OCW, except for Doppler, where the methods are com- 
parable. For Blocks, both LARK and OCW use the Haar wavelet, thus any 
difference in results is due to the prior distribution on the function; LARK 
leads to a 50% reduction in AMSE compared to OCW. For the other exam- 
ples, both OCW and EBayesThresh uses a Laplace prior distribution for each 
coefficient in the expansion and the same wavelet; in all cases it is clear that 
using a continuous dictionary is better than the finite-dimensional dictio- 
nary (frame) with the nondecimated wavelets. Lark reconstructions (right 
column. Figure 1) consistently show less ringing and fewer artifacts than 
EBayesThresh (left column). 
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02468 10 02468 10 




02468 10 02468 10 



Fig. 1. Comparison of fitted functions using EBayesThresh beta, laplace [John- 
stone and Silverman (2005a)] (left column) and Levy Adaptive Regression Kernels 
(LARK-Gamma) (right column) for the four test functions. From top to bottom, the test 
functions are Blocks, Bumps, Doppler and Heavysine, respectively. 
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Fig. 2. Left: results of the LARK model for the motorcycle crash data. Circles represent 
the observations; solid line is the posterior mean; dotted lines are pointwise 90% Bayesian 
credible interval for the mean function. Right: histogram of posterior samples of the expo- 
nential power parameter p, with prior density (solid line) for comparison. 



8. Applications. 

8.1. Motorcycle crash data. To further illustrate the method, we explore 
the motorcycle crash experiment data of Schmidt, Mattern and Schiiler 
(1981) considered by Silverman (1985), shown in Figure 2. The 133 observa- 
tions are unequally spaced, with repeated observations at some time points. 
Our focus in this example is to illustrate how a single wide class of generating 
functions may be used in LARK, with the data (through the likelihood) influ- 
encing the choice of kernels present in the posterior distribution. We use the 
power exponential family of kernel functions 4>{x; x, A, p) = exp{— A|x — xl''}; 
but here (in contrast with the examples in Section 7) we treat p as an un- 
certain parameter and make inference about it from the data. We take the 
power p to be common for all kernels, and use a relatively concentrated 
Gamma prior distribution p ~ Ga(2. 0,0.75) with a 50% HPD interval of 
[0.58, 2.56] which comfortably includes both the Laplace {p = 1) and Gaus- 
sian (p = 2) kernels as special cases. 

The results are summarized in Figure 2. It is apparent that the fitted mean 
captures the general trend of the data very well, with minimal boundary 
effects. The model is parsimonious in the sense we only need 4 kernels on 
average to fit the data. The posterior mean for p is approximately 3 with 
most of the posterior mass well above the values (p = 1,2) for the Laplace 
and Gaussian kernels. 

8.2. Spatial temporal model. In this section, we explore the performance 
of the LARK approach for modeling hourly SO2 concentration levels (mea- 
sured in ppm) in Pennsylvania, New Jersey, Delaware and Maryland [U.S. 
EPA (2007)]. The locations of the 33 monitoring stations are shown in Fig- 
ure 3; the study region 5, delineated by a rectangle in the figure, covers 
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Fig. 3. (a) Thirty-three monitors used by EPA to measure hourly SO2 concentration in 
year 2002. The inverted triangle denotes Site 31. The study area is delineated by a rectangle 
that includes parts of Pennsylvania, Maryland, New Jersey and Delaware, and is blown 
up in (b) which illustrates locations of kernels from a draw from the posterior distribution. 
The blue circles represent aperiodic points and the red circles represent daily periodic point 
sources. Circle areas are proportional to the magnitudes of the point sources they represent. 



a 310 km x 310 km area. We used rescaled coordinates from a Lambert 
(conformal conic) projection to reduce the distortion caused by the earth's 
curvature. For demonstration purposes, we restrict analysis to measurements 
taken during a 144 hour period T from September of 2002. About 5% of SO2 
readings are missing (at random) from the data set, which is not a prob- 
lem for the LARK model. While Gaussian random field models are popular 
for modeling spatial-temporal data, the log transformation typically used 
in the Gaussian approach (because the mean function is strictly positive) 
eliminates many of the (important) spiky features of the data. Our Gamma 
random field prior distribution allows us to model the data in the original 
units. 

The model can be written in the same simple form as (52), but now the 
SO2 concentration Y{x) is indexed by points x£X = SxTin space-time 
and the Levy random measure C{duj) assigns Gamma-distributed random 
variables to Borel sets of a space of points uj = (cr, r, A,A) that include 
a location (cr, r) G 5 x 7~ in space-time, a positive-definite 2x2 spatial 
dispersion matrix A G 5^, and a temporal decay rate A > 0. We employ 
a separable kernel of the form 

(l){x,uj) = exp{-(s - cr)'A(s - cr)/2 - X\t - r|} 

and in the spirit of Higdon [(1998), Section 3.2] and Higdon, Swall and Kern 
[(1999), Section 2.2], we employ a novel parametrization for A in terms of 
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its eigenvalues and the orientation of its major axis [see Tu (2006), Sec- 
tion 4.2.6, for details on prior specifications]. In variations also described in 
Tu [(2006), Chapter 4] accommodation is made for partial periodicity (due 
to diurnal patterns associated with daily variation in ambient temperature, 
traffic levels, etc.), still within the framework described by (51) but now 
with more elaborate choices for 0, and (j){x,uj). 

The locations of latent point sources from one iteration of the RJ-MCMC 
algorithm are presented in Figure 3(b). Larger latent points appear to be 
clustered in the Baltimore metropolitan area and near the New Jersey/Penn- 
sylvania border. The model's support points are more than a mere modeling 
device — they can help analysts identify possible underlying sources of pol- 
lution, or support future decisions on monitor locations. 

The predictive power of the model is validated through out-of-sample 
prediction. The model was fit excluding data from Site 31 [the inverted tri- 
angle in Figure 3(a)], and then its predictions were compared with reported 
measurements from that site for the entire 144 hours. The result shown 
in Figure 4 is promising. The major peak was captured clearly, and 90% 
pointwise Bayesian credible intervals cover in excess of 80% of the true ob- 
servations. This was a challenging out-of-sample prediction problem due to 
low cross-correlations among sites. We are currently refining features of the 
prior distributions to incorporate known point sources. 



leave-one-out prediction 




hours 

Fig. 4. Out-of-sample predictions for Site 31. Dashed line represents observed time se- 
ries, solid line represents predictive mean curve. Gray lines are 90% posterior predictive 
intervals. 
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9. Discussion. In this article, we have developed a fully Bayesian adap- 
tive kernel method, LARK, for nonparametric function estimation. The 
LARK model is based on a stochastic expansion of functions in a continuous 
overcomplete dictionary, and may be expressed as a stochastic integral of 
a kernel or other generating function with respect to a Levy random field. 
When (7) is satisfied (so compensation is unnecessary), the Levy field is 
a random signed measure. By using a positive random measure and pos- 
itive kernel family, LARK models provide natural constructions for non- 
negative functions (as in Section 8.2); with signed measures, unconstrained 
functions may be modeled (as in Sections 7 and 8.1). The kernel parame- 
ters are location-specific and thus adapt to local features of the data. As 
with wavelets, the adaptive smoothing using LARK preserves local features 
such as discontinuities and high peaks and is especially useful for modeling 
inhomogeneous functions. The LARK approach does not require that the 
data be equally-spaced without missing observations nor that the sample 
size be a dyadic power as is a commonly required of many wavelet meth- 
ods. 

The RJ-MCMC algorithm developed for fitting LARK provides an 
automatic stochastic search mechanism for finding sparse representations 
of a function. The algorithm is computationally efficient [requiring only 
0{n ■ M) operations for data including n observations and an MCMC stream 
of length M], as dictionary elements are calculated only when needed. Ker- 
nel methods such as Support Vector Machines (SVMs) and Bayesian Rele- 
vance Vector Machines [or RVMs, Tipping (2001)] employ all data points as 
kernel locations, but attain sparsity by shrinking coefficients to zero. LARK 
provides additional fiexibility by not restricting kernel locations. Many com- 
peting sparse methods, including the Dantzig Selector and Lasso, require the 
a priori selection of a pre-specified number of dictionary elements. Evalu- 
ating these kernels on a sufficiently fine grid will exceed the computational 
cost of LARK. Fine grids also lead to extreme multicollinearity in these ap- 
proaches, that may lead both to numerical instability and violation of the 
conditions needed for sparse solutions. 

9.1. Extensions. It is straightforward to implement LARK with wide 
classes of generating functions including wavelets, structural elements in tex- 
ture analysis, and splines. Unlike support vector machines or other methods 
based on Mercer kernels [Pillai et al. (2007)], the LARK approach does not 
require symmetry, continuity or simple functional forms. While it is often 
convenient to use kernels based on some distance metric, arbitrary generat- 
ing functions may be tailored to the problem at hand as illustrated in the 
space-time example of Section 8.2. The LARK modeling approach adapts 
readily to problems in any number of dimensions. 
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In Section 4, we present conditions for LARK models to belong to the 
same Besov space as their generating functions, for Levy measures and gen- 
erating functions that satisfy the stringent local Li-bound of (18). In the 
more general case, where (18) fails and compensation is required, we are able 
to establish similar results only for B^g with p = q = 2 (equivalent to Wl). 
We are exploring extensions to the general case, but the additional drift 
term that arises in compensation complicates confirming the convergence 
of to / in for general p, q. 

Work is also on-going in establishing conditions for posterior consistency 
for function estimation. Extending methods of Choudhuri, Ghosal and Roy 
(2004), Ghosal and van der Vaart (2007) and Choi and Schervish (2007), 
Pillai (2008) has verified posterior consistency for certain LARK models 
with Gaussian measurement errors in work that will be reported elsewhere. 

APPENDIX A: DETAILS OF PROOFS 

Proposition 2. For a function g{-) G Lp{R'^) and its scaled translate 
g{A{- — x)) with x G R'^ and positive definite matrix A € 5^, the Lp norm of 
g{A{- — x)) o,nd the Lp norm of its mth forward differences are given by 

(59) MM- - xMp = \A\'/ng\\p\\A^g{A{. - xMp = |A| Vp||A^,5|Ip, 
where \ A\ denotes the determinant of A. 

Proof. By a change of variables x'~^ u = A{x — x), 
I Air</(A(. - x))ll = -I / |Ar9(A(x - x))l'dx ■ 



E 

fc=0 



m 



{-ir-^g{A{x + kh-x)) 



l/p 



dx 



Iai-vp / 

l"^ fc=0 

\A\-'I^At^g\\p. 



E 



m 



{-l)^~^g{u + kAh) 



l/p 



du 



The proof for the Lp norm of g{A[- — x)) follows by the same change of 
variables. □ 



A.l. Proof of Lemma 1. First, consider the case 6 > 1 and a G M. Then 

(1 A \zg{uY\\-°-)\-^n^{dz) dXdu 

x[l,oo)xT 
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< 




[l,oo)xT 



A Tiz{dz)d\du 



1 



< oo. 



h-l 

Next, consider the case of 6 < 1 and a > 1 — 6 (which imply a > 0): 

{I ^\zg{uy\\~'')\~\,{dz)d\du 

d\TTz{dz) du 

~a~b 




Rx[l,oo)xT 



+ 



\zg{uY\>lJl 

\zgiuy 



RxT 



zg{uy\>l 



A 



1-6 



l-b 



lV|z3(n)''|i/'' 
A=|zg{n)''|i/° 



A"""''dA7r^(dz)dn 



-Kzidz) du 



+ 



A=l 



\^9{u) ,^ , 
MxT i - a - A=lV|2g(u)'-|i/" 

1-1^5(^)^(1-'')/'^ 



TTz{dz) du 



zg{uY\>l 



6-1 



+ 



+ 



// 

J J\zg(u 

II 

J J\zn(u 



IzgiuYl'^^-''^' 



zg(uy\>i a + b-1 
\zg{u 



-Tiz{dz) du 



■■Kz{dz) du 



zg{ur\<l 



a + b-1 



TTz{dz) du 



< 



1 



6-1 
1 



+ 



+ 



IzgiuYl^^-^^/"- — ^- TTTTzidz) du 

Mxt' ^ ' ' (a + 6-l)(l-6) ' 



z\i^-b)/a^^^dz) / IgiuYl^^-^^/^du 



6-1 (a + 6-l)(l-6) 

< oo 



for a + 6>l ifr = 0, and for ap + 6>l ifr = l [since g € L*(T)], which is 
imphed by a > 1 — 6. 

Now consider the case of 6 = 1 and a > 0: 



]Rx[l,oo)xT 



{I A\zg{uY\X-'')X-''Trzidz) dXdu 
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\zg{uY\>lJl 



A ^d\TTz{dz)du 



oo 



+ / / \zg{uy\ / X-"-^ dXTT,{dz)du 



log A 

\zg(uY\>l 



xT 



\=\zg{uY\^/'^ 

TTzidz) du 

A=l 

^-a A=oo 

TTz{dz) du 

X=lV\zg{uY\'^/^ 



= [ [ —log\zg{uY\Trz{dz)du+ f f -T^z{dz)du 

J J\zg{uY\>l a J J\zg{uY\>l " 

+ // \^^^z{dz)du 
J J\zg{uY\<l " 

< - I I logj^\zg{uY\7rz{dz)du+- 

< OO 

since log_^_{zg^) = (0 V log Iz^*^']) < \z\ + \g\'^ and g G L^(T). 

A. 2. Proof of Theorem 2. For any compensator function h{f3) satisfy- 
ing (10) there are numbers Cj G (0, oo) such that 

\h{P)\<co, |/?-/i(/3)|<ci(|/3|A/32), |/i(/3)|<C2(lA 



for ah /? G M. Fix < e < 1 and a function (/):M x O M satisfying (16); 
let Ba, Bb and Be be the values of the integrals from (16a)-(16c), respec- 
tively. To complete the proof of Theorem 2 it suffices to show that each of 
the two terms from (27), 

X = [ [ {P-h{(3))^{uj)Mid/3du}) and 

(60) 

Y = [ [ h{l3)(l){u})M{dl3doj), 

converges to zero in probability as e — )• 0. Write the first integral in (60) as 
the sum of two parts: 



X = [ [ iP-hiP))^{co)M{dpdoj) = Xi+X2 
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with 



Xi= iP-h{f3))(P{u;)Midf3duj), 
J J N^n[\f5<p\<i] 

X2 = [ [ i/3-h{l3))(l){u;)Mid(3duj). 
J J N,n[\is<j,\>i] 



Then 



E|Xi|<ci// i\/3\Af3^)\<l){coMdpdu;) 
J J N^n[\i3<i)\<i] 

= ci[[ {lA(3^muj)\u{dpdoj) 
J ^Af£n[|/3|<i]n[|/3<^|<i] 



+ ci / / (1 A|/3</.H|)i/((i/3(ia;) 

J J7Ven[|/3|>i]n[|/3<^|<i] 

< Ci{Bc + Ba) <00, 

SO Xi — )• in Li as e — )• by Lebesgue's dominated convergence theorem 
since the indicator function l|jVg}(/3, tends to zero a.e. (v) as e — )• 0. Now 
consider X2: 

z.({(/3,a;):|Ma;)|>l})= / / Mdf3 dco) 



+ / / lv{dpduj) 

< 1 1 mi^)\^\m^)\^Md/3dio) 

]/3|<i]n[|/3<ii|>i] 



+ / / {lA\p(t>{uj)\)u{d(3dco) 
J Jm>i]n[\i3ct,\>i] 

< Bb + Ba<oo, 

so almost surely the random support of M{dj3duj) in [\f3(j)\ > 1] is a finite set 
disjoint from fl^^^Q A^^; it follows that J\f{Ne Pi [|/3(/)(a;)| > 1]) and hence 
X2 — )■ almost surely as e — )• 0. 

Similarly, we write the second integral in (60) as the sum of four parts: 



Y= I h{f3)(P{io)M{dl3 dio) = Yi + Y2 + Y3 + Y4 



with 



Yi= I h{l3)(t){Lo)N{d(3duj), 

' N,n\\p\<i]n\m<i] 
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Y2 = [ [ h{/3)(j){u)M{df3du;), 
J iAr^n[|/3|<i]n[|/3</.|>i] 

Ys= f f h{^)<P{Lo)M{dl3duj), 

J J N,n[\i3\>i]n[\i3,j)\<i] 

Yi= f f h{p)(j){uj)M{d/3duj). 

J JN,n[\^\>i]n[\^4>\>i] 

Now 

|2 / / u^a\2i/, .\2 



E\Yi\^= / / h{PY(t){ojYv{dpdoj) 

J J N^C\[\l3\<l]n[\l3<j>\<l] 
<4 [ [ \^(l){u})\'^u{dpdoj) 

= 4 [ [ (|/30(w)| A \P(P{oj)\^)u{dpduj) 

J JAf£n[|/3|<i]n[|/3(/-|<i] 

< c^Bb < oo, 

so Yi — )• in L2 (and hence also in Li) as e — )■ by LDCT, 
Y2 = [ [ h{p)(l){u)M{df3duj) 

h{p)4>{uj)J^{dl3duj) 

N,n[\p\<i]n[\P4>\>i] 

h{l3)(p{uj)v{d/3duj), 

N,nm<i]nl\P4>\>i] 

E|l2|<2 / / \hm\H^Md/3du;) 
J J N,n[\i3\<i]n[\i3<t>\>i] 

<2c2 [ [ \P^{uj)\i^{dl3 du) 

= 2c2 [ [ {\m^)\^\m^)\''Hd^dco) 

J J N,n[\i3\<i]n[\i3^\>i] 

< 2C2B1, < 00, 
so 12 — >• in Li as e — )• by dominated convergence, 



^3 = / / h{(3)<j){uj)N{dl3duj) 
'7V,n[|/31>iln[|/3<^l<i] 



h{(3)(l){uj)N{dl3duj) 
Af,n[|/31>i]n[|/3<^l<i] 
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h{l3)(l){uj)u{df3duj), 

Ar,n[|/3|>i]n[|/3(/.|<i] 

E|y3l<2 / / ihimH^Mdl^dij) 

J J Nen[\i3\>i]n[\i34>\<i] 



<2c2 / / \P(t>{uj)\u{dpduj) 



2c2 / / {lA\f3(l){oj)\Mdl3duj) 

I N,n[\i3\>i]n[\i3cp\<>i] 



< 2C2-Ba < OO, 

SO 13 — >■ in Li as e — )• 0. Finally, for I4, 



¥4= I h{(3)(l){uj)N{dl5duj) 
/Af,n[|/3|>i]n[|/3<^|>i] 



h{P)(t){uj)M{d^duj) 
7V,n[|/3I>i]n[|/3</.|>i] 



h{(3)<p{uj)v{d(3duj), 
Af,n[|/3|>i]n[|/3</.|>i] 



Einl <2 / / \h{l5)(t){uj)\v{d(3 duj) 



<2co / / \(l){u:)\u{dl3 duj) 

<2co [ [ {1 A I3^)\(j){uj)\i^{dl3 duj) 

< 2cqBc < 00 

so 14 — )• in Li as e — )• 0, completing the proof of Theorem 2. 

APPENDIX B: REVERSIBLE-JUMP MCMC PROCEDURES 

A typical RJ-MCMC procedure for sampling varying-dimensional param- 
eters involves at least three types of moves (Birth, Death and Update); we 
use Metropolis-Hastings steps for each of these. Our trans-dimensional up- 
date steps entail altering the value (/3*,a;*) of one point {f3j,ujj). We select 
J ~ Un(0: J — 1) for proposed updating, then take Gaussian random walk 
steps successively in the coefficient /3j, the location parameter Xj^ ^'^^ fhe 
log kernel shape parameter, logAj. Step sizes are chosen to achieve approxi- 
mately 30% acceptance rates for each class of updates. One novel feature is 
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that when the proposed update of some coefficient /3j falls in the truncated 
region f]*7] E (— the move is treated as a Death, the point [f5j,bjj) is 
removed and J is decremented. This is advantageous as it automatically 
focuses on small magnitude coefficients for removal (rather than a random 
selection as in the typical RJ-MCMC Death step). A Birth step entails 
generating a new point (/3*,a;*) to be included among the {(/3j,a;j)} and 
incrementing J by one. We use a double exponential birth distribution with 
rate ry/e, conditioned to exceed \j3j\rj > e so that proposed coefficients are 
small, balancing the "Death" of small coefficients in the Update step to at- 
tain the target acceptance rates. The fixed-dimensional parameters are sam- 
pled using a conventional Metropolis-Hastings approach [Gilks, Richardson 
and Spiegelhalter (1996), Section 1.3.3]. Each of these inexpensive update 
steps requires only 0{n) operations [in contrast to Gaussian methods, which 
may require 0(n^)], so the method scales well in the number n of observa- 
tions. Further details of the RJ-MCMC are available in [Tu (2006), Ap- 
pendix A.l, pages 116 and 117]. An R package [R Development Core Team 
(2004)] implementing LARK is under development by the authors and will 
be made publicly available. 

APPENDIX C: EXAMPLES OF LARK PRIOR REALIZATIONS 



Symmetric Gamma LARK Sample Path with Blocl^s l^ernel Gamma LARK Sample Path with Bumps kernel 




Fig. 5. Four realizations from, LARK prior distribution with (a) Blocks kernel and Sym- 
metric Gamma Levy measure; (b) Bumps kernel and Gamma Levy measure; (c), (d) 
Doppler kernel and Cauchy Levy measure, with J = 1000 for (a)-(c) and J = 10 for (d) 
components. Hyperparameters a\, b\, a-y, 6-y, a^, br, and e are given in Table 2. 
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Cauchy LARK Sample Path with Doppler kernel Cauchy LARK Sample Path with Doppler kernel 




4 6 



(c) 




10 2 4 6 



(d) 



Fig. 5. (Continued.) 
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